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Abstract Normalized Compression Distance (NCD) is 
a popular tool that uses compression algorithms to clus¬ 
ter and classify data in a wide range of applications. Ex¬ 
isting discussions of NCD’s theoretical merit rely on cer¬ 
tain theoretical properties of compression algorithms. 
However, we demonstrate that many popular compres¬ 
sion algorithms don’t seem to satisfy these theoretical 
properties. We explore the relationship between some of 
these properties and file size, demonstrating that this 
theoretical problem is actually a practical problem for 
classifying malware with large file sizes, and we then in¬ 
troduce some variants of NCD that mitigate this prob¬ 
lem. 

1 Introduction 

In the era of big data, techniques that allow for data 
understanding without domain expertise enable more 
rapid knowledge discovery in the sciences and beyond. 
One technique that holds such promise is the Normal¬ 
ized Compression Distance (NCD) [14], which is a sim¬ 
ilarity measure that operates on generic file objects, 
without regard to their format, structure, or semantics. 

NCD approximates the Normalized Information Dis¬ 
tance, which is universal for a broad class of similarity 
measures. Specifically, the NCD measures the distance 
between two files via the extent to which one can be 
compressed given the other, and can be calculated us¬ 
ing standard compression algorithms. 

NCD, and its open source implementation Com- 
pLearn [5] have been widely applied for clustering, ge- 
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nealogy, and classification in a wide range of application 
areas. Its creators originally demonstrated its applica¬ 
tion in genomics, virology, languages, literature, music, 
character recognition, and astronomy [7]. Subsequent 
work has applied it to plagiarism detection [4], image 
distinguishability [18], machine translation evaluation 
m, database entity identification m, detection of in¬ 
ternet worms m, malware phylogeny m, and mal¬ 
ware classification [T] to name a few. 

Assuming some simple properties of the compres¬ 
sion algorithm used, the NCD has been shown to be, in 
fact, a similarity metric [ 7 ]. However, it remains to be 
seen whether real word compression algorithms actu¬ 
ally satisfy these properties, particularly in the domain 
of large files. As data storage has become more afford¬ 
able, large files have become more common, and the 
ability to analyze them efficiently has become impera¬ 
tive. Music recommendation systems work with MP3s 
which are typically several megabytes in size, medical 
images may be up to 30 MB or more [9] , and computer 
programs are often more than 100 MB in size. 

This paper explores the relationship between file size 
and the behavior of NCD, and proposes modifications 
to NCD to improve its performance on large files. 

Section provides an introduction to NCD and the 
compression algorithm axioms that have been used for 
proving it to be a similarity metric. Section explores 
the extent to which several popular (and not-so pop¬ 
ular) compression algorithms satisfy these axioms and 
investigates the impact of file size on its effectiveness 
for malware classification. Finally, section proposes 
two possible adaptations of the NCD definition, for the 
purpose of improving its performance on large files, 
and demonstrates significant performance improvement 
with several compressors on a malware classification 
problem. 
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2 NCD Background 


The motivating idea behind the Normalized Compres¬ 
sion Distance is that the similarity of two objects can 
be measured by the ease with which one can be trans¬ 
formed into the other. This notion is captured formally 
by the information distance^ E{X,Y)^ between two 
strings, X, T, which is the length of the shortest pro¬ 
gram that can compute Y from X or X from Y in some 
fixed programming language. The information distance 
generalizes the notion of Kolmogorov complexity, where 
K{X) is the length of the shortest program that com¬ 
putes X, and intuitively captures a very general notion 
of what it means for two objects to be similar. 

However, for the purposes of computing similarity, 
it is important that distances be relative. Two long 
strings that differ in a single character should be con¬ 
sidered more similar than two short strings that differ 
in a single character. This leads to the definition of the 
Normalized Information Distance (NID), 

NiDrx ri = 

NID(X,r) _ 

The NID has several nice features: it satisfies the 
conditions of a metric up to a finite additive constant, 
and it is universal, in the sense that it minorizes ev¬ 
ery upper semi-computable similarity distance [7] . How¬ 
ever, it is also incomputable, which is a serious obstacle. 

Given a compression algorithm, C, X(X, Y) can, in 
some sense, be approximated by (7(XT), the result of 
compressing with C the file consisting of X concate¬ 
nated with T, and NID(X, Y) can, in turn, be approx¬ 
imated by 


NCD(X, Y) 


\C{XY)\-min{\C{X)\,\C{Y)\) 

max(|C(X)|,|C(F)|) 


However, in order to prove that NCD is a similarity 
metric, [7] placed several restrictions on the compres¬ 
sion algorithm. A compression algorithm satisfying the 
conditions below is said to be a normal compressor. 


where C{X) denotes the string X' resulting from the 
application of compressor C to string X, XT denotes 
the concatenation of X and T, and |X| denotes the 
length of string (or file) X. 

The question remains whether existing compression 
algorithms satisfy these axioms, particularly in the do¬ 
main of large files. While NCD has apparently been 
quite successful in practice, the majority of applications 
have been on relatively small files. (See section]^) No¬ 
tably, music applications [HlIT] . used MIDI files rather 
than the more common, and much larger, MP3 format. 

Previous work [3] explored the NCD distance from a 
file to itself (which is closely related to the idempotence 
axiom) for bzip, zlib, and PPMZ on the Calgary Corpus 
[22] . comprising 14 files, the largest of which is under 1 
MB. The following section explores these axioms on a 
larger and more representative dataset and investigates 
the practical impact of deviations from normality. 


3 Application of NCD to Large Files 

3.1 Normality of Compression Algorithms 

The definition of a normal compressor deals with asymp¬ 
totic behavior, allowing for an 0(log(n)) discrepancy 
in the axioms of idempotence, monotonicity, symme¬ 
try, and distributivity. Thus, in theory, experimental 
validation (or refutation) of these axioms is not truly 
feasible - perhaps the behavior changes when the file 
size is beyond that of the largest file in our experi¬ 
ment. Nonetheless, we endeavor to experimentally ex¬ 
plore these axioms more extensively than has been done 
in prior work. 

Data We combined the traditional Calgary Corpus with 
the Large and Standard Canterbury Corpora, as well 
as the Silesia CorpuQ The latter contains files of size 
ranging from 6 MB to 51 MB, greatly expanding the 
size distribution over the previous corpora. 


Normal Compression A normal compressor, (7, as de¬ 
fined in definition 3.1 in [ 7 ], is one that satisfies the fol¬ 
lowing, up to an additive O(logn) term, where n is the 
largest length of an element involved in the (in)equality 
concerned: 

— Idempotence: \C{XX)\ = \C{X)\ and |C(A)| = 0, 
where A is the empty string. 

- Monotonicity: |C(XT)| > |C(X)|. 

- Symmetry: |C(XT)| = |C(TX)|. 

— Distributivity: 

\C{XY)\ + \C{Z)\ < \C{XZ)\ + \C{YZ)\. 


Idempotence Figures and show the difference in 
the sizes of C{X) and (7(XX), and log(|XX|), for a 
representative subset of files X in the dataset, with C 
ranging over compression algorithms bzip2 m, Izma 
m, PPMZ [2], and zlib m- Indeed, bz2 and zlib quite 
apparently fail the idempotence axiom, with \C{XX)\ 
growing much faster than |(7(X)|, with a factor of 
log(IXXI) unable to put a dent in the difference. While 
PPMZ and Izma appear significantly better, still, this 

^ These are standard corpora for the evaluation of com¬ 
pression algorithms and are available at http://www.data- 
compression.info/Corpora/ 
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Fig. 1 Idempotence on compression corpora: \C{XX) \ — |C(X)| as compared to log(|XX|) versus |XX|. 


value grows much faster than log(|XX|), as apparent in 
figure We see that Izma makes a large jump around 
8 MB (but even before that, its growth is much larger 
than the log function). 

Symmetry Figure shows the magnitude of difference 
between \C{XY)\ and \C{YX)\. While in most cases, 
at this scale, this was bounded by log(|Xy|) (and in all 
cases by a small constant factor thereof), the asymp¬ 
totic behavior is unclear, as values for all four com¬ 
pressors spike wildly. This is likely due to the fact that 
the extent of the symmetry is highly dependent on the 
compressibility and similarity of the two files involved, 
zlib and Izma look quite promising for symmetry, while 
the asymptotic behavior of PPMZ and bz2 is not dis¬ 
cernible. 

Distributivity and Monotonicity Initial experiments with 
distributivity and monotonicity did not give cause for 
concern. 


Our experiments have shown serious violation of the 
idempotence axiom that has been used to prove theo¬ 
retical properties of NCD, leaving a potential gap be¬ 
tween theory and practice. The next section explores 
the extent to which NCD can be useful in spite of this 
gap. 

3.2 Classification using NCD with Abnormal 
Compressors 

We have demonstrated that none of the compression al¬ 
gorithms we explored satisfy the requirements for nor¬ 
mal compression. The question remains whether this 
contraindicates their use with NCD. As mentioned above, 
much previous work has demonstrated NCD’s utility 
with some of these compression algorithms in applica¬ 
tions with small file sizes. However, the compressors’ 
deviation from normality grows with file size. Do they 
remain useful with with larger files? 

To address this question, we explored the accuracy 
of NCD in identifying the malware family of APK files 
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Fig. 2 Idempotence on compression corpora: Enlargement of a portion of the graph in figure [ij to more clearly show the 
behavior for smaller files. 


from the Android Malware Genome Project dataset [23l 
In particular, we took a subset of 500 samples from 
the Geinimi, DroidKungFuS, DroidKungFud, and Gold- 
Dream families 0 Geinimi samples in this dataset have 
size up to 14.1 MB, DroidKungFuS up to 15.4 MB, 
DroidKungFu4 up to 11.2 MB, and GoldDream up to 
6.4 MB. 

We evaluated the NGD with the same four com¬ 
pression algorithms as above, using a nearest neighbor 
classifier [8] with a single (randomly selected) instance 
of each malware family in the reference set Note that 
we intentionally restricted the reference set to make the 
classification problem difficult in order to explore the 
limitations of the compression algorithms when used 

^ We selected these families due to their containing enough 
samples to allow for a meaningful test, and containing large 
enough files to challenge the compressors. 

^ For readers unfamiliar with nearest neighbor classifica¬ 
tion, specifically we classified a ’’test” sample by looking at 
the distance between it and each of the ’’reference” samples, 
and selecting the family of the nearest (i.e. most similar) ref¬ 
erence sample. 


with NGD. Results are shown in figure In spite of 
clearly violating the idempotence property, both Izma 
and PPMZ performed significantly better than random 
guessing. In line with their relative normality, Izma per¬ 
formed best, at, 59.7% with PPMZ up next at 44.4%. 
Although bz2 is slightly closer to satisfying the idem¬ 
potence property than zlib, zlib actually outperformed 
bz2, albeit not by much, with accuracies of 33.3% and 
29.8%, respectively, with neither performing much bet¬ 
ter than random guessing. 

To demonstrate the relevance of file size, we per¬ 
formed the same test with one slight change, this time 
using only reference samples smaller than 200 KB.We 
saw drastic improvement with bz2 (now 75.4%), Izma 
(82.5%), and PPMZ (66.7%), while zlib’s performance 
actually got worse (29.2%). 

Finally, looking only at files smaller than 200 KB 
yielded improved performance by bz2 (89.7%), zlib 
(37.9%), and PPMZ (75.9%), but Izma actually per¬ 
formed slightly worse (75.9%). The latter suggests that 
file size is not the only factor that can inhibit the perfor- 
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Fig. 3 Symmetry: The difference between \C{XY)\ and |C(TX)|, as compared to log(|Xy|). 
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Fig. 4 Accuracy of NCD in identifying Android malware 
family, using a 1-NN classifier 

mance of a compression algorithm with NCD. Notably, 
bz2 outperformed Izma on these files. These results are 
shown in figure 



4 Adapting NCD to Handle Large Files 

We saw in section |T^ that NCD has widely varying per¬ 
formance on large files, depending on the compression 
algorithm used. The memory limitations of the algo¬ 
rithm are key here. The major hurdle is to effectively 
use information from string X for the compression of 
string Y in computing C{XY). Algorithms like bz2 and 
zlib have an explicit block size as a limiting factor; if 


\X\ > block_size, then there is no hope of benefiting 
from any similarity between X and Y. In contrast, Izma 
doesn’t have a block size limitation, but instead has a 
finite dictionary size; as it processes its input, the dic¬ 
tionary grows. Once the dictionary is full, it is erased 
and the algorithm starts with an empty dictionary at 
whatever point it has reached in its input. Again, if this 
occurs before reaching the start of T, hope of detecting 
any similarity between X and Y is lost. Likewise, even 
if X is small, but Y is large, with the portion of Y that 
is similar to X appearing well into T, the similarity 
can’t be detected. 

Thus, it seems logical that we could improve the ef¬ 
fectiveness of NCD by bringing similar parts of X and Y 
in closer proximity of one another; rather than comput¬ 
ing NCD using C{XY)^ we propose using C{J{X,Y)) 
where J is some method of combining strings X and Y. 
So, we define 

\C{J{X,Y))\-min{\C{X%\C{Y)\) 

max{\C{X%\C{Y)\) 

In the original definition of NCD, J is simply concate¬ 
nation. In an ideal world, J would locate similar chunks 
of X and Y and place them adjacently. However, if J is 
too destructive of the original strings, much of the orig¬ 
inal compression of X and Y individually will be lost, 
resulting in a higher overall value for NCDc^j{X^Y). 
Thus, we want these similar chunks to be as large as 
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Fig. 5 Effect of file size on accuracy of NCD in identifying Android malware family, using a 1-NN classifier 


possible so as to still allow both chunks to fit within the 
block size, or to allow processing of them both within 
the same dictionary. There are some simple ways to 
achieve this. 

One approach would be to apply a string alignment 
algorithm to X and T, and combine the two strings 
so that aligned segments are located in sufficient prox¬ 
imity. However, while Hirschberg’s algorithm m allows 
for such alignment to be performed in linear space, thus 
eliminating memory issues, it takes time proportional to 
the product of the file sizes and is thus quite slow with 
large files. Further, this is limited to finding a very spe¬ 
cific type of similarity, which is order-dependent. How¬ 
ever, we propose two other approaches inspired by this 
notion. 

Interleaving The simplest approach is to assume that 
similar parts of x and y are similarly located, and just 
weave them together in chunks of size b. Say X = 
xiX 2 ...x„ and Y = yiy 2 --ym, where \xi\ = \yj\ = b 
for 1 < z < n — 1 and l<j<m — 1, 0< |a;„| < b, and 
0 < Ij/toI < 6. Then define 

XiyiX 2 y 2 ---Xnynyn+l-ym iin<m 
xiyiX 2 y 2 ■ ■ ■ XmymXm+i-Xn otherwise 

NCD-shuffle Another approach is to split both strings 
into chunks of the desired size (selected to be appropri¬ 
ate for the compression algorithm) and apply the tradi¬ 
tional NCD to determine the similarity of each chunk of 
X to each chunk of T, and align them accordingly, with 
the most similar chunks from the two strings adjacent. 


Table 1 Comparison of performance of different combining 
functions with NCD in a 1-NN classifier for Android malware 
family identification, with varying block sizes (block sizes in 
thousands of KB) 



COneat 

IL 1 

IL 10 

IL 100 

IL 1000 

bz2 

0.298 

0.464 

0.462 

0.456 

0.308 

zlib 

0.333 

0.19 

0.194 

0.131 

0.317 

Izma 

0.597 

0.637 

0.643 

0.635 

0.603 

PPMZ 

0.444 

0.357 

0.484 

0.438 

0.442 


concat 

NS 10 

NS 100 

NS 1000 

bz2 

0.298 

0.522 

0.423 

0.325 


zlib 

0.333 

0.433 

0.200 

0.325 


Izma 

0.597 

0.641 

0.643 

0.627 


PPMZ 

0.444 

0.371 

0.438 

0.435 



4.1 NCD Adaptation Results 

Using the original classification problem from section 
|3.2[ we applied the interleaving (IL) and NCD-shuffle 
(NS) file combination techniques with various block sizes 
with each of the compression algorithms. As shown in 
tableand figurein all cases, one or both techniques 
yielded a better performance than the traditional NCD. 
Figure 1^ also includes the accuracy when 5 representa¬ 
tives from each family are used for comparison (with 
the exclusion of PPMZ, which was too slow for this ex¬ 
periment). Most notably, these techniques boosted bz2 
from 29.8% accuracy to 52.2% accuracy with a single 
training sample, and from 55.2% to 75.2% with 5 train¬ 
ing samples, and boosted zlib from 30% to 74.8% with 
5 training samples. 
































On Normalized Compression Distance and Large Malware 


7 



Fig. 6 Traditional NCD compared to the best of the alternative combiners we explored for Android malware family identifi¬ 
cation 


Note that we also performed smaller experiments on 
music MP3 data and medical image data, and also saw 
improvements ther^ so we expect these techniques to 
offer improvement not just in malware classification, 
but in all domains where large files are prevalent. 

5 Conclusion and Future Directions 

We have demonstrated that several compression algo¬ 
rithms, Izma, bz2, zlib, and PPMZ, apparently fail to 
satisfy the properties of a normal compressor, and ex¬ 
plored the implications of this on their capabilities for 
classifying Android malware with NCD. More generally, 
we have shown that file size is a factor that hampers 
the performance of NCD with these compression algo¬ 
rithms. Specifically, we found that Izma performs best 
on this classification task when files are large (at least 
in the range we explored), but that bz2 performs best 
when files are sufficiently small. We have also found zlib 
to generally not be useful for this task. PPMZ, in spite 
of being the top performer in terms of idempotence, did 
not come close to the most accurate compressor in any 
case. 

We introduced two simple file combination tech¬ 
niques that boost the performance of NCD on large 
files with each of these compression algorithms. 

However, the challenges of choosing the optimal com¬ 
pression algorithm and the optimal combination tech¬ 
nique (and parameters therefor) remain. For supervised 

^ For example, on a set of 66 mammography images from 
DDSM p!^fTT] . zlib improved from 31.3% accuracy to 54.7% 
accuracy in identifying cancerous images, and bz2 improved 
from 26.6% to 62.5% accuracy. 


classification applications, it is easy enough to use a test 
set to aid in the selection of the technique and block size 
parameter for the relevant domain. However, for clus¬ 
tering or genealogy tasks, the burden remains to study 
several resulting clusterings or hierarchies to determine 
which is most appropriate. 

It remains for future work to better understand what 
properties of a data set make it more or less amenable 
to the different compression algorithms and different 
combination techniques and parameters. 

Nonetheless, these techniques offer enhanced NCD 
performance in malware classification (as well as other 
tasks) with large files, and suggest that further research 
in this direction is worth pursuing. 
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